11.71 Conceptual Excercises

Exercise 11.1

We would want to build a regression model with more than 1 predictor because using more than 1 predictor can tell us more about our outcome variable. Including multiple predictors can improve a model and allow it to be used for prediction/forecasting.

Exercise 11.2

Modeling car’s miles per gallon in a city (Y) by the make of the car as represented by mu = B_0 + B_1X_1 + B_2X_2 + B_3*X_3 where X_1 = Kias, X_2 = Subarus, and X_3 = Toyotas.

  1. There is no indicator term for the Ford category because it is our baseline or reference level of the make variable and will have X_0 = 0.

  2. B_2 represents the typical difference in a car’s gas mileage for a 1-unit increase in X2 i.e. versus a Ford, Subarus have an X2 change in gas mileage.

  3. B_0 represents the typical gas mileage for a Ford.

Exercise 11.3

mu = B_0 + B_1X_1 + B_2X_2 where B_0 is weight of Mr. Stripey, X_1 is days a tomato has been growing and X_2 is Roma tomato.

  1. B_0 is our reference and level and it is the weight of a Mr. Stripey tomato. B_1 measures the weight of a tomato for a 1-unit increase in X_1 which is days a tomato has been growing. B_2 measures the typical change in the weight of a tomato for a unit increase of X_2 i.e. the weight change as a result of being a Roma tomato.

  2. If B_2 were equal to 0 that means mu = B_0 + B_1*X_1 = weight of a Mr. Stripey tomato that has grown for X_1 days. I.e. if B_2 were 0, we would be measuring the size of a tomato when it is a Mr. Stripey tomato growing for X number of days. Weight is outcome and the Mr. Stripey and days of growth are the predictors.

Exercise 11.4

  1. For X_1 and X_2 to interact it means that X_1 moderates the relationship between X_2 and mu. In tomato terms, The number of days a tomato is growing (X_1) moderates the relationship between the Roma tomato category (X_2) and a tomato’s weight in grams (Y).

  2. B_3 is the interaction coefficient and it captures the difference in slopes which represents the change in weight when Roma category is excluded/included.

Exercise 11.5

  1. Sketch a model that would benefit from interaction between a categorical and quantitative predictor. Used view(weather_WU) in console to see weather data for categorical ad quant variables.

Exercise 11.6

Y = shoe size X_1 = children’s age in years X_2 = children’s swimming knowledge

  1. Generally speaking, adding predictors improves models because outcomes may be correlated with more than 1 predictor. Adding predictors incorporates flexibility that improves upon models with high bias.

  2. It is possible to overfit. When there is high variance and low bias in a model, the model closely aligns with data which means it is also reflecting noise or error in the data and is actually a worse model than those that with in between varaince and in-betweeen bias.

  3. For a model on children’s shoe size I would add children’s height in cm. I would add height assuming that height and feet length have some proportion. It’s also data that is feasible and probablly fair to collect.

  4. I would remove children’s knowledge of swimming as there is not a reasonable causal pathway for how swimming or not leads to variation in shoe size.

Exercise 11.7

  1. Good models capture the central tendency and range of data it was created with. Good models have also undergone training and testing. They also have bias and variability that is neither too high or too low.

  2. Bad models can be overly simple (high bias, low variability) or overly complex (low bias, high variability) meaning they don’t capture enough of key features of the data or capture too much noise/error.

Exercise 11.8

Techniques to assess models are visualization, cross-validation, and ELPD. Visualizations are used to evaluate predictive accuracy and compare different models. We can use pp_check and posterior predictive models to visualize how models compare to data and to each other. Cross-validation is a way to numerically assess the posterior predictive quality of models. Using k-fold cross-validation involves training the model on one set of data and testing with another. ELPD (expected log-predictive densities) is another numerical way to asess predictive accuracy. Larger expected logged posterior predictive pdf corresponds to more accurate posterior predictions of Y.

Exercise 11.9

The bias-variance tradeoff is the the challenge of using enough predictors to get useful information about Y without including too much information. Too few predictors will give models with low variance across samples and high bias. Too many predictors will capture noise and errors and the model will be unstable; the model will have high variance and low bias.

11.7.2 Applied Exercises

Load packages for applied exercises.

#loading libraries for exercises
library(bayesrules)
library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
##   options(mc.cores = parallel::detectCores())
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tidybayes)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(broom.mixed)

Using penguins_bayes to build models of body_mass_g (Y). Average penguin weighs between 3500-4500 g (estimated variance 250) X= species = Adelie, Chinstrap, Gentoo = categorical. Will work with 2 level predictors in penguin_data.

#loading alternative penguin data
penguin_data <- penguins_bayes |> 
  filter(species %in% c("Adelie", "Gentoo"))

Exercise 11.10

  1. Plot and summarize relationship among 3 variables (Y = body_mass_g, X_1 = flipper_length, X_2,3 = species)
ggplot(data = penguin_data, aes(y = body_mass_g, x = flipper_length_mm, color = species)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

For both species, as flipper length increases, body mass increases. For Gentoo the increase is more rapid, suggesting that species may moderate the relationship between flipper length and body mass.

  1. Use stan_glm() to simulate a posterior Normal regression of body_mass_g by flipper_length_mm (X) and species, without an interaction.
penguin_model <- stan_glm(
  body_mass_g ~ flipper_length_mm + species,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), #midpoint of avg g given in th ech.
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), #1/250
  chains = 4, iter = 5000*2, seed = 8000)

MCMC diagnostics

#Store 4 chains for each parameter in 1 data frame for later 
penguin_model_df <- as.data.frame(penguin_model)
#visual diagnostics 
#trace and density plots 
mcmc_trace(penguin_model, size = 0.1)  

# Density plots of parallel chains
mcmc_dens_overlay(penguin_model)

#chains look stable and density plots look 
#numerical diagnostics 
# Effective sample size ratio and Rhat
neff_ratio(penguin_model)
##       (Intercept) flipper_length_mm     speciesGentoo             sigma 
##           0.47355           0.46930           0.47975           0.71670
rhat(penguin_model)
##       (Intercept) flipper_length_mm     speciesGentoo             sigma 
##          1.000186          1.000184          1.000119          1.000060

The R-hat values are very close to 1 meaning the chains are stable and mix quickly. The effective sample size ratio.

  1. Produce tidy() summary of this model. Interpret the non-intercept coefficients’ posterior median values in context.
tidy(penguin_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.80) |> 
  select(-std.error)
## # A tibble: 5 × 4
##   term              estimate conf.low conf.high
##   <chr>                <dbl>    <dbl>     <dbl>
## 1 (Intercept)        -4379.   -5262.    -3487. 
## 2 flipper_length_mm     42.5     37.8      47.1
## 3 speciesGentoo        217.      78.2     361. 
## 4 sigma                393.     373.      416. 
## 5 mean_PPD            4316.    4273.     4358.

flipper_length_mm is positively associated with body_mass_g as is species Gentoo, which is 216.52 g heavier than Adelie.

  1. Simulate, plot and describe the posterior predictive model for the body mass of an Adelie penguin that has a flipper length of 197.

Exercise 11.1 (penguins interaction)

  1. Use stan_glm() to simulate the posterior for this model with 4 chains at 10,000 iterations.
penguin_interaction_model <- stan_glm(
   body_mass_g ~ species + flipper_length_mm + species:flipper_length_mm,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)
  1. Simulate and plot 50 posterior model lines.
pp_check(penguin_interaction_model)

The plot shows that the model captures the range and central tendencies of the data well. Both the model and the 50 lines have a minimum of 2 peaks. While the model doesn’t overfit the data, the use of multiple predictors gave us a better model than flipper length alone.

For comparison here is the pp_check of a non-interaction model with one predictor, flipper length.

penguin_one_model <- stan_glm(
   body_mass_g ~ flipper_length_mm,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)
pp_check(penguin_one_model)

tidy(penguin_interaction_model, effects = c("fixed", "aux"))
## # A tibble: 6 × 3
##   term                            estimate std.error
##   <chr>                              <dbl>     <dbl>
## 1 (Intercept)                      -2893.     875.  
## 2 speciesGentoo                    -3341.    1314.  
## 3 flipper_length_mm                   34.7      4.60
## 4 speciesGentoo:flipper_length_mm     17.3      6.42
## 5 sigma                              387.      17.0 
## 6 mean_PPD                          4316.      33.0
posterior_interval(penguin_interaction_model, prob = 0.80,
                   pars = "speciesGentoo:flipper_length_mm")
##                                      10%      90%
## speciesGentoo:flipper_length_mm 9.174402 25.60938
penguin_data |> drop_na() |> 
  add_fitted_draws(penguin_interaction_model, n = 200) |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_line(aes(y=.value, group = paste(species, .draw)), alpha = 0.1)
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## Use [add_]epred_draws() to get the expectation of the posterior predictive.
## Use [add_]linpred_draws() to get the distribution of the linear predictor.
## For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).

Based on summary and this visualization I would say that the interaction term is necessary. Species moderates the relationship between flipper_length_mm and body_mass_g. Being Gentoo (blue) as opposed to Adelie increases the positive relationship between flipper_length_mm and body_mass_g. The 80% posterior credible interval for the interaction coefficient is above 0 and ranges from 9.17 to 25.61.

Exercise 11.12

Three predictors and no interactions: flipper_length_mm, bill_length_mm, bill_depth_mm.

penguin_model_3 <- stan_glm(
  body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm ,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)

# Confirm prior specification
prior_summary(penguin_model_3)

# Check MCMC diagnostics
mcmc_trace(penguin_model_3)

mcmc_dens_overlay(penguin_model_3)

mcmc_acf(penguin_model_3)

posterior_interval(penguin_model_3, prob = 0.95)
##                          2.5%       97.5%
## (Intercept)       -7249.42574 -5028.55451
## flipper_length_mm    26.11961    38.23013
## bill_length_mm       55.62631    87.56523
## bill_depth_mm        27.55968    80.22603
## sigma               312.67673   370.20147
  1. All three 3 have significant positive association with body mass. None of the 95% credible intervals for the model parameters overlap with 0. Comparing all predictors of body_mass_g (below) we see that year has a negative assocation with body_mass_g. Island has no signifact association.
penguin_model_4 <- stan_glm(
  body_mass_g ~ .,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)
posterior_interval(penguin_model_4, prob = 0.95)
##                               2.5%         97.5%
## (Intercept)           16967.071808 170943.720217
## speciesGentoo           332.968842    852.046816
## islandDream            -123.291986     68.918310
## islandTorgersen        -141.008733     58.292925
## year                    -85.827030     -8.837732
## bill_length_mm            8.169976     35.197786
## bill_depth_mm            18.796586     90.250796
## flipper_length_mm         7.792649     19.496381
## above_average_weight1   387.550791    591.059962
## sexmale                 292.640265    470.133041
## sigma                   219.832953    262.245305

Exercise 11.13

  1. simulate 4 models with stan_glm() function.
#model 1
penguin_model_1 <- stan_glm(
  body_mass_g ~ flipper_length_mm,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)

#model 2
penguin_model_2 <- stan_glm(
  body_mass_g ~ species,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)

#model 3
penguin_model_3 <- stan_glm(
  body_mass_g ~ flipper_length_mm + species,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)

#model 4 
penguin_model_4 <- stan_glm(
  body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
  data = penguin_data, family = gaussian, 
  prior_intercept = normal(4000, 250), 
  prior = normal(0, 2.5, autoscale = TRUE), 
  prior_aux = exponential(0.004, autoscale = TRUE), 
  chains = 4, iter = 5000*2, seed = 8000)
  1. Produce pp_check() for the 4 simulations
pp_check(penguin_model_1)

pp_check(penguin_model_2)

pp_check(penguin_model_3)

pp_check(penguin_model_4)

#creating data set without missing data
penguins_comp <- penguin_data |> 
  select(flipper_length_mm, body_mass_g, species, 
         bill_length_mm, bill_depth_mm) |> 
  na.omit()
#cross validation of four models 
set.seed(8000)
prediction_summary_cv(model = penguin_model_1, data = penguins_comp, k = 10)
## $folds
##    fold      mae mae_scaled within_50 within_95
## 1     1 246.8452  0.6274958 0.5357143 0.9285714
## 2     2 304.5235  0.7694913 0.4074074 0.9629630
## 3     3 248.7804  0.6151063 0.5555556 0.9629630
## 4     4 222.7751  0.5595664 0.5357143 0.9642857
## 5     5 348.6981  0.8869174 0.3333333 0.9259259
## 6     6 215.6457  0.5382611 0.6666667 0.9259259
## 7     7 300.2348  0.7666828 0.4285714 0.9285714
## 8     8 304.6348  0.7619095 0.4444444 1.0000000
## 9     9 203.3990  0.5042036 0.5925926 0.9629630
## 10   10 226.4402  0.5581455 0.5714286 0.9642857
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 262.1977   0.658778 0.5071429 0.9526455
set.seed(8000)
prediction_summary_cv(model = penguin_model_2, data = penguins_comp, k = 10)
## $folds
##    fold      mae mae_scaled within_50 within_95
## 1     1 360.4104  0.7543538 0.4285714 0.8928571
## 2     2 403.6032  0.8329245 0.3333333 1.0000000
## 3     3 437.2558  0.9064028 0.2962963 1.0000000
## 4     4 286.3682  0.5941468 0.5714286 0.9642857
## 5     5 441.2444  0.9101616 0.4814815 0.9259259
## 6     6 365.2359  0.7502957 0.4814815 0.9629630
## 7     7 280.3658  0.5804143 0.6071429 0.9285714
## 8     8 526.8396  1.1137135 0.2592593 1.0000000
## 9     9 311.6731  0.6344041 0.5555556 0.9629630
## 10   10 317.0627  0.6399031 0.5357143 1.0000000
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 373.0059   0.771672 0.4550265 0.9637566
set.seed(8000)
prediction_summary_cv(model = penguin_model_3, data = penguins_comp, k = 10)
## $folds
##    fold      mae mae_scaled within_50 within_95
## 1     1 261.6374  0.6649825 0.5000000 0.9285714
## 2     2 317.8054  0.7972312 0.4074074 0.9629630
## 3     3 308.3408  0.7605406 0.3703704 0.9629630
## 4     4 208.8396  0.5273125 0.6071429 0.9642857
## 5     5 399.5042  1.0394804 0.2962963 0.9629630
## 6     6 226.1543  0.5715921 0.6666667 0.9259259
## 7     7 324.0330  0.8287271 0.4285714 0.9285714
## 8     8 331.1047  0.8313187 0.2962963 1.0000000
## 9     9 225.0195  0.5639652 0.5925926 0.9629630
## 10   10 207.6343  0.5124322 0.6071429 0.9642857
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 281.0073  0.7097582 0.4772487 0.9563492
set.seed(8000)
prediction_summary_cv(model = penguin_model_4, data = penguins_comp, k = 10)
## $folds
##    fold      mae mae_scaled within_50 within_95
## 1     1 251.3421  0.7481516 0.4285714 0.8928571
## 2     2 214.7109  0.6247227 0.5185185 1.0000000
## 3     3 232.5114  0.6652209 0.5185185 1.0000000
## 4     4 180.3222  0.5208608 0.6071429 0.8928571
## 5     5 231.0877  0.6818039 0.4814815 0.8888889
## 6     6 149.1957  0.4362503 0.5925926 0.8888889
## 7     7 280.3003  0.8412392 0.4285714 0.8928571
## 8     8 220.2031  0.6302804 0.5555556 1.0000000
## 9     9 237.0537  0.6880010 0.4814815 0.9629630
## 10   10 158.7289  0.4522308 0.6071429 0.9642857
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 215.5456  0.6288762 0.5219577 0.9383598